Vinho Verde wine quality

Vinho Verde wine quality

# CHOOSE FILE TO ANALYSE: Please use 'Red_Wine.csv'
redwine <- read.csv("Red_Wine.csv", header = TRUE, sep = ",")

A. The Business Problem

Our winery in the Northwest of Portugal produces Vinho Verde wine. More recently we have seen the quality of our red wine decline, each year losing spots at the internationally acclaimed “Best of Vinho Verde awards”.

We are wondering if there is anything we could do to improve the quality and rating of the wine by impacting the production process and influence some physio-chemical characteristics of the wine. Our rational is that the grapes are the same as our competitors so there is certainly room for improvement. We have a limited budget to work on the improvement of our wine and would like to understand which components matter most and how to invest to improve the production of better wine.

We gathered data from 1599 red Vinho Verde wines with their measurable characteristics:

  1. Fixed acidity (tartaric acid - g / dm3): most acids involved with wine or fixed or nonvolatile (do not evaporate readily).
  2. Volatile acidity (acetic acid - g / dm3)**: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste.
  3. Citric acid (g / dm3): found in small quantities, citric acid can add ‘freshness’ and flavor to wines
  4. Residual sugar (g / dm3): the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet.
  5. Chlorides (sodium chloride - g / dm3): the amount of salt in the wine.
  6. Free sulfur dioxide (mg / dm3): the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine.
  7. Total sulfur dioxide (mg / dm3): amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
  8. Density (g / cm3): the density of water is close to that of water depending on the percent alcohol and sugar content
  9. pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
  10. Sulphates (potassium sulphate - g / dm3): a wine additive which can contribute to sulfur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant
  11. Alcohol (% by volume): the percent alcohol content of the wine
  12. Quality: The expert quality rating has been calculated as the median of 3 evaluations made by experts purely based on their sensorial experience of tasting the wines.

A.1. The Summary Statistics Table for Our Dataset

The basic structure of the data is as follows:

## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

A summary of the data can be found below:

fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol quality
Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900 Min. :0.01200 Min. : 1.00 Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40 Min. :3.000
1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50 1st Qu.:5.000
Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200 Median :0.07900 Median :14.00 Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200 Median :10.20 Median :6.000
Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539 Mean :0.08747 Mean :15.87 Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42 Mean :5.636
3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10 3rd Qu.:6.000
Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500 Max. :0.61100 Max. :72.00 Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90 Max. :8.000

The correlation of the data is shown in the graph below


A.2. The Business Solution Process

We are thus exploring whether there is a link between purely physio-chemical characteristics and perceived quality of the wines in order to tailor our production process to consumer preferences in order to achieve consistent quality of our wines. To conduct this analysis, we plan to execute the following steps:

  1. Descriptive statistics of the data (preview summary above)
  2. Review the data for outliers, clean if needed, and create additional factors or variables as we see fit and select the variables that will be relevant for the model
  3. Split the data into a training and a test data set
  4. Conduct regression analysis on the training data set
  5. Explore value of further analysis (e.g., factor analysis and segmentation)
  6. Develop different models to predict the best wine
  7. Evaluate our different models using the test data set
  8. Select the model with the less errors in the predictions

Once we have the best model, we will be able to know the characteristics that make a great wine. We will be able to adapt our production process in order to produce a wine that will win awards, use our investment resources in the most effective way and that will appeal to the customers.


0. Understanding your data

Understanding out wine data is one of the major activities during the wine data analysis. Understanding our wine data deals with detecting and removing errors and inconsistencies from the wine data in order to improve the quality of data. It will also play major role during decision-making process.

A good approach should satisfy several requirements. First of all, we have a dictionary where all the variables are explained. Then, we should detect and remove all major errors and inconsistencies from the wine data. This approach is supported by R tools to limit manual inspection and programming effort.


1 Descriptive statistics of the data

1.1 Data format checking:

The first step is to ensure that the data read from the CSV file is read in the right format.

redwine$fixed.acidity <- as.numeric(redwine$fixed.acidity)
redwine$volatile.acidity <- as.numeric(redwine$volatile.acidity)
redwine$citric.acid <- as.numeric(redwine$citric.acid)
redwine$residual.sugar <- as.numeric(redwine$residual.sugar)
redwine$chlorides <- as.numeric(redwine$chlorides)
redwine$free.sulfur.dioxide <- as.numeric(redwine$free.sulfur.dioxide)
redwine$total.sulfur.dioxide <- as.numeric(redwine$total.sulfur.dioxide)
redwine$density <- as.numeric(redwine$density)
redwine$pH <- as.numeric(redwine$pH)
redwine$sulphates <- as.numeric(redwine$sulphates)
redwine$alcohol <- as.numeric(redwine$alcohol)
redwine$quality <- as.integer(redwine$quality)

The data is now in the right format, that’s fantastic!

1.2. Summary Statistics

The wine data looks like this:

Wine 1 Wine 2 Wine 3 Wine 4 Wine 5 Wine 6 Wine 7 Wine 8 Wine 9 Wine 10
fixed.acidity 7.40 7.80 7.80 11.20 7.40 7.40 7.90 7.30 7.80 7.50
volatile.acidity 0.70 0.88 0.76 0.28 0.70 0.66 0.60 0.65 0.58 0.50
citric.acid 0.00 0.00 0.04 0.56 0.00 0.00 0.06 0.00 0.02 0.36
residual.sugar 1.90 2.60 2.30 1.90 1.90 1.80 1.60 1.20 2.00 6.10
chlorides 0.08 0.10 0.09 0.08 0.08 0.08 0.07 0.06 0.07 0.07
free.sulfur.dioxide 11.00 25.00 15.00 17.00 11.00 13.00 15.00 15.00 9.00 17.00
total.sulfur.dioxide 34.00 67.00 54.00 60.00 34.00 40.00 59.00 21.00 18.00 102.00
density 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.99 1.00 1.00
pH 3.51 3.20 3.26 3.16 3.51 3.51 3.30 3.39 3.36 3.35
sulphates 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.80
alcohol 9.40 9.80 9.80 9.80 9.40 9.40 9.40 10.00 9.50 10.50
quality 5.00 5.00 5.00 6.00 5.00 5.00 5.00 7.00 7.00 5.00

The basic structure of the data is as follows:

## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

A summary for each variable can be found below:

fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol quality
Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900 Min. :0.01200 Min. : 1.00 Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40 Min. :3.000
1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50 1st Qu.:5.000
Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200 Median :0.07900 Median :14.00 Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200 Median :10.20 Median :6.000
Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539 Mean :0.08747 Mean :15.87 Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42 Mean :5.636
3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10 3rd Qu.:6.000
Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500 Max. :0.61100 Max. :72.00 Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90 Max. :8.000

2. Review data and select variables

2.1 Data inconsistency checking

The second step is to ensure that the data read from the CSV file does not contain any missing values. As shown in the summary data above, it does not contain any missing values. Thus, we can proceed with the study.

2.2 Histograms

We plot the histograms of the data provided:

2.3 Data correlation

Now, we will see in more detail the correlation of each variable. It will be interesting for the project. The correlation matrix is:

fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol quality
fixed.acidity 1.00 -0.26 0.67 0.11 0.09 -0.15 -0.11 0.67 -0.68 0.18 -0.06 0.12
volatile.acidity -0.26 1.00 -0.55 0.00 0.06 -0.01 0.08 0.02 0.23 -0.26 -0.20 -0.39
citric.acid 0.67 -0.55 1.00 0.14 0.20 -0.06 0.04 0.36 -0.54 0.31 0.11 0.23
residual.sugar 0.11 0.00 0.14 1.00 0.06 0.19 0.20 0.36 -0.09 0.01 0.04 0.01
chlorides 0.09 0.06 0.20 0.06 1.00 0.01 0.05 0.20 -0.27 0.37 -0.22 -0.13
free.sulfur.dioxide -0.15 -0.01 -0.06 0.19 0.01 1.00 0.67 -0.02 0.07 0.05 -0.07 -0.05
total.sulfur.dioxide -0.11 0.08 0.04 0.20 0.05 0.67 1.00 0.07 -0.07 0.04 -0.21 -0.19
density 0.67 0.02 0.36 0.36 0.20 -0.02 0.07 1.00 -0.34 0.15 -0.50 -0.17
pH -0.68 0.23 -0.54 -0.09 -0.27 0.07 -0.07 -0.34 1.00 -0.20 0.21 -0.06
sulphates 0.18 -0.26 0.31 0.01 0.37 0.05 0.04 0.15 -0.20 1.00 0.09 0.25
alcohol -0.06 -0.20 0.11 0.04 -0.22 -0.07 -0.21 -0.50 0.21 0.09 1.00 0.48
quality 0.12 -0.39 0.23 0.01 -0.13 -0.05 -0.19 -0.17 -0.06 0.25 0.48 1.00

And the graphical representation is:

From the matrix and plot above, we can derive that:

  • Fixed Acidity
    • There is a positive correlation with citric acid. This is true since citric acid is one of the fixed acid in wine.
    • There is a positive correlation with density.
    • There is a significant negative correlation with pH.
  • Volatile Acidity
    • There is a highly negatively correlation with citric acid
  • Free SO2
    • There is a significant positive correlation with total SO2.
  • Density
    • There is a significant negative correlation with alcohol and pH.
    • There is a positive correlation with fixed acidity, citric acid and residual sugar
  • Quality (dependent variable)
    • Quality and alcohol are positively correlated
    • Quality is negative correlated with volatile acidity.

2.4 Plots

2.4.1 Quality vs. other independent variables

And now we plot the rest of the variables against the quality:

2.4.2 Additional plots

In this section, we plot other graphs to see how the data is distributed based on the correlation matrix

2.5 Selection of variables

The main objective of our exercise is to help identify poor quality wine based on its chemical attributes. Poor quality, or faulty wines, have been defined in our dataset based on their quality:

Wines with a quality below 5 are considered bad wines - we do not want to sell these wines to the public. This category is of primary interest in our study.

In addition to this, there is a clear distinction between the number of wines classified with quality 5 & 6 and between the number of wines classified as over 6. Thus, a good way of classification is as follows: average wines are those wines whose quality is between 5 and 6 and good wines are those whose quality is above 6.

Classification Quality # Occurrences in estimation data
Faulty 4 or less 63 (4%)
Average 5 & 6 1319 (82%)
Good 7 or greater 217 (14%)
Total - 1599

Faulty wine

min 25 percent median mean 75 percent max std
fixed.acidity 4.60 6.80 7.50 7.87 8.40 12.50 1.65
volatile.acidity 0.23 0.56 0.68 0.72 0.88 1.58 0.25
citric.acid 0.00 0.02 0.08 0.17 0.27 1.00 0.21
residual.sugar 1.20 1.90 2.10 2.68 2.95 12.90 1.72
chlorides 0.04 0.07 0.08 0.10 0.09 0.61 0.08
free.sulfur.dioxide 3.00 5.00 9.00 12.06 15.50 41.00 9.08
total.sulfur.dioxide 7.00 13.50 26.00 34.44 48.00 119.00 26.40
density 0.99 1.00 1.00 1.00 1.00 1.00 0.00
pH 2.74 3.30 3.38 3.38 3.50 3.90 0.18
sulphates 0.33 0.50 0.56 0.59 0.60 2.00 0.22
alcohol 8.40 9.60 10.00 10.22 11.00 13.10 0.92
quality 3.00 4.00 4.00 3.84 4.00 4.00 0.37

Normal wine

min 25 percent median mean 75 percent max std
fixed.acidity 4.70 7.10 7.80 8.25 9.10 15.90 1.68
volatile.acidity 0.16 0.41 0.54 0.54 0.64 1.33 0.17
citric.acid 0.00 0.09 0.24 0.26 0.40 0.79 0.19
residual.sugar 0.90 1.90 2.20 2.50 2.60 15.50 1.40
chlorides 0.03 0.07 0.08 0.09 0.09 0.61 0.05
free.sulfur.dioxide 1.00 8.00 14.00 16.37 22.00 72.00 10.49
total.sulfur.dioxide 6.00 24.00 40.00 48.95 65.00 165.00 32.71
density 0.99 1.00 1.00 1.00 1.00 1.00 0.00
pH 2.86 3.21 3.31 3.31 3.40 4.01 0.15
sulphates 0.37 0.54 0.61 0.65 0.70 1.98 0.17
alcohol 8.40 9.50 10.00 10.25 10.90 14.90 0.97
quality 5.00 5.00 5.00 5.48 6.00 6.00 0.50

Good wine

min 25 percent median mean 75 percent max std
fixed.acidity 4.90 7.40 8.70 8.85 10.10 15.60 2.00
volatile.acidity 0.12 0.30 0.37 0.41 0.49 0.92 0.14
citric.acid 0.00 0.30 0.40 0.38 0.49 0.76 0.19
residual.sugar 1.20 2.00 2.30 2.71 2.70 8.90 1.36
chlorides 0.01 0.06 0.07 0.08 0.08 0.36 0.03
free.sulfur.dioxide 3.00 6.00 11.00 13.98 18.00 54.00 10.23
total.sulfur.dioxide 7.00 17.00 27.00 34.89 43.00 289.00 32.57
density 0.99 0.99 1.00 1.00 1.00 1.00 0.00
pH 2.88 3.20 3.27 3.29 3.38 3.78 0.15
sulphates 0.39 0.65 0.74 0.74 0.82 1.36 0.13
alcohol 9.20 10.80 11.60 11.52 12.20 14.00 1.00
quality 7.00 7.00 7.00 7.08 7.00 8.00 0.28

3. Split the data into a training and a test data set

set.seed(1985) #set a random number generation seed to ensure that the split is the same everytime

redwine_split <- createDataPartition(y =,redwine$quality,
                               p = 1298/1599, list = FALSE) # we put one less to make the training. We use 80% to estimate the  value and 20 percent for training

training_redwine <- redwine[ redwine_split,]
testing_redwine<- redwine[ -redwine_split,]

4. Analysis process

The analysis process carried out is based on the 6-step process provided in class. We decided that we would use an implementation of Breiman and Cutler’s Random Forests for Classification and Regression. As a result we did not have to restrict ourselves to a binary dependent variable.

4.1 Multinomial logistic regression

4.1.1 Simple multinomial logistic regression

The first test we are going to do is the multinomial logistic regression. The idea is simple, we will try to derive the taste quality parameter based on the other 12 independent variables. A summary of the regression is shown below

## Call:
## multinom(formula = taste ~ fixed.acidity + volatile.acidity + 
##     citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + 
##     total.sulfur.dioxide + density + pH + sulphates + alcohol, 
##     data = training_redwine)
## 
## Coefficients:
##        (Intercept) fixed.acidity volatile.acidity citric.acid
## faulty    405.6065     0.6609973         4.697646   0.9022834
## good      165.1007     0.2172616        -2.580776   0.5805793
##        residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide
## faulty      0.3739949  7.039428        -0.025045965          -0.01196597
## good        0.2213255 -8.453456         0.008314928          -0.01831228
##          density        pH sulphates    alcohol
## faulty -435.8021 7.1278877 -1.142488 -0.6364384
## good   -179.8604 0.4315777  3.755149  0.7501815
## 
## Std. Errors:
##        (Intercept) fixed.acidity volatile.acidity citric.acid
## faulty    3.043451    0.16482994        0.9103453   1.3053491
## good      1.871254    0.08843828        0.8479439   0.9097131
##        residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide
## faulty     0.08221751  3.288558          0.02391135          0.008781613
## good       0.06611148  3.531090          0.01358676          0.005692609
##         density        pH sulphates    alcohol
## faulty 2.967102 1.5336428 1.3665900 0.18444246
## good   1.825386 0.9346179 0.5627972 0.09671016
## 
## Residual Deviance: 1073.626 
## AIC: 1121.626
Coefficients for Simple multinomial regression
(Intercept) fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
faulty 405.6065 0.6609973 4.697646 0.9022834 0.3739949 7.039428 -0.0250460 -0.0119660 -435.8021 7.1278877 -1.142488 -0.6364384
good 165.1007 0.2172616 -2.580776 0.5805793 0.2213255 -8.453456 0.0083149 -0.0183123 -179.8604 0.4315777 3.755149 0.7501815

As multinomial logistic regression does not provide the p-values, we will calculate them by normalizing the results. The calculated p-values are as follows:

P-value table for Simple multinomial regression
(Intercept) fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
faulty 0 0.0000607 0.0000002 0.4894273 0.0000054 0.0323078 0.2948916 0.1730033 0 0.0000034 0.4031472 0.0005593
good 0 0.0140240 0.0023379 0.5233433 0.0008147 0.0166654 0.5405461 0.0012961 0 0.6442469 0.0000000 0.0000000

The model summary output has a block of coefficients and a block of standard errors. Each of these blocks has one row of values corresponding to a model equation. Focusing on the block of coefficients, we can look at the first row comparing prog = “faulty” to our baseline prog = “average” and the second row comparing prog = “good” to our baseline prog = “average”. If we consider our coefficients from the first row to be b1 and our coefficients from the second row to be b2, we can write our model equations as follows

\[\ln(\dfrac{P(prog=faulty)}{P(prob=average)})=b_{10}+b_{11}+b_{12}+b_{13}\] \[\ln(\dfrac{P(prog=good)}{P(prob=average)})=b_{20}+b_{21}+b_{22}+b_{23}\] We can also use predicted probabilities to help you understand the model. We can calculate predicted probabilities for each of our outcome levels using the fitted function. We can start by generating the predicted probabilities for the observations in our dataset and viewing the first few rows

average faulty good Taste from original data (1 is average, 2 is faulty and 3 is good)
0.8830922 0.1084340 0.0084737 1
0.9620570 0.0295885 0.0083545 1
0.9557823 0.0324463 0.0117714 1
0.9227038 0.0115456 0.0657506 1
0.8830922 0.1084340 0.0084737 1
0.9126397 0.0786353 0.0087250 1
0.9640966 0.0286594 0.0072439 1
0.8979686 0.0796937 0.0223377 3
0.9224107 0.0544786 0.0231107 3
0.9089993 0.0137815 0.0772192 1

We can predict the test values based on this regression:

average faulty good
average 248 7 30
faulty 1 0 0
good 4 0 10

Based on the outputs, we have an missclassification error of 14 %. Also, we can plot the ROC curve. The ROC curve illustrates the performance of a binary classifier system as its discrimination threshold varies. The curve shows the true positive rate against the false positive rate at various threshold settings. The true-positive rate is also known as recall and the false-positive rate is also known as the fall-out or probability of false alarm.

The area under the curve is 88.0051764%.

Note: Display AUC value: 90+% - excellent, 80-90% - very good, 70-80% - good, 60-70% - so so, below 60% - not much value.

4.1.2 Stepwise multinomial logistic regression

In this case, we use stepwise regression to find the best parameters. The summary of the simulation is as follows

Coefficients table for Stepwise multinomial regression
(Intercept) fixed.acidity volatile.acidity residual.sugar chlorides total.sulfur.dioxide density pH sulphates alcohol
faulty 188.7181 0.5036951 4.426422 0.2642166 7.604152 -0.0171398 -214.0411 5.8123193 -1.536922 -0.4210022
good 240.5433 0.3228469 -2.774771 0.2518946 -7.486849 -0.0156194 -256.9891 0.7773722 3.821127 0.7003148

Again, we need to calculate the new p-values based on the previous results

P-value table for Stepwise multinomial regression
(Intercept) fixed.acidity volatile.acidity residual.sugar chlorides total.sulfur.dioxide density pH sulphates alcohol
faulty 0 3.29e-04 2.00e-07 0.0013179 0.0192639 0.0047548 0 0.0001573 0.2662019 0.019449
good 0 1.14e-05 7.59e-05 0.0000831 0.0264939 0.0001211 0 0.3972321 0.0000000 0.000000

We can predict the test values based on the stepwise multimodal regression:

average faulty good
average 246 7 30
faulty 1 0 0
good 6 0 10

Based on the outputs, we have an missclassification error of 14.6666667 %. Also, we can plot the ROC curve. The ROC curve illustrates the performance of a binary classifier system as its discrimination threshold varies. The curve shows the true positive rate against the false positive rate at various threshold settings. The true-positive rate is also known as recall and the false-positive rate is also known as the fall-out or probability of false alarm.

The area under the curve is 87.9951056%.

Note: Display AUC value: 90+% - excellent, 80-90% - very good, 70-80% - good, 60-70% - so so, below 60% - not much value.

4.1.3 Summary of the results

With the simple multinomial regression taking into account all the variables, we are able to predict the quality of wine with a missclassification error of 14 %. Based on the p-values, the parameters that characterize a good wine are as follows:

  • fixed acidity: Increase fixed acidity
  • Volatile acidity: Decrese volatile acidity
  • Residual Sugar: Increase residual sugar
  • Density: Reduce the density
  • Chlorides: Reduce cholrides
  • Total SO2: Reduce total SO2
  • Sulphates: Increase the sulphates
  • Alcohol: Increase the alcohol

Also, based on the results, the key differences of a good wine from a bad wine are found in the Total SO2, Sulphates and the Alcohol level. We will need topay attention to those values.

4.2 Classification and Interpretation: Random Forest tree

4.2.1 Refining the parameters

It is time now to run a classification algorithm on the data set. We have chosen to use the random forest tree algorithm for this.

First of all, we have a look at the confusion matrix results:

Predicted average Predicted faulty Predicted good Class error
Actual average 506 257 303 0.53
Actual faulty 15 35 6 0.38
Actual good 7 8 162 0.08

And now the same confusion matrix from a percentage in class perspective (rows sum too 100%)

Predicted average Predicted faulty Predicted good
Actual average 47.47% 24.11% 28.42%
Actual faulty 26.79% 62.5% 10.71%
Actual good 3.95% 4.52% 91.53%

This is how the error looks like:

After several trials, it seems that the error tends toget stabilized after the 80 trees. We have selected 128 trees.

For the final version of the model, we tried several combinations until we found the ones we like:

Parameter Value (average, faulty, good)
classwt 10^{-5}, 1, 1
sampsize 56, 56, 56
cutoff 0.4, 0.3, 0.3
mtry 9
ntree 256

4.2.2 Prediction based on the parameters

After running the predictionw with the testing sample, the following confusion matrix is obtained:

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction average faulty good
##    average     109      2    2
##    faulty       64      3    0
##    good         80      2   38
## 
## Overall Statistics
##                                         
##                Accuracy : 0.5           
##                  95% CI : (0.442, 0.558)
##     No Information Rate : 0.8433        
##     P-Value [Acc > NIR] : 1             
##                                         
##                   Kappa : 0.1985        
##                                         
##  Mcnemar's Test P-Value : <2e-16        
## 
## Statistics by Class:
## 
##                      Class: average Class: faulty Class: good
## Sensitivity                  0.4308       0.42857      0.9500
## Specificity                  0.9149       0.78157      0.6846
## Pos Pred Value               0.9646       0.04478      0.3167
## Neg Pred Value               0.2299       0.98283      0.9889
## Prevalence                   0.8433       0.02333      0.1333
## Detection Rate               0.3633       0.01000      0.1267
## Detection Prevalence         0.3767       0.22333      0.4000
## Balanced Accuracy            0.6729       0.60507      0.8173

The ROC curve illustrates is as follows.

The area under the curve is 94.270975%.

Note: Display AUC value: 90+% - excellent, 80-90% - very good, 70-80% - good, 60-70% - so so, below 60% - not much value.

4.2.3 Summary of the random forest

The model is strongly biased towards the faulty classification. The model is pretty good at predicting the good wines, and this is exacly what we are looking for. Although the model might seem to be terrible at dealing with the average wines (usually splitting them 47% average, 24% faulty, 28% good) we are not concerned with upward misclassifications. Also, with such an abundance of average wines, losing out on 47% of them is not so terrible. The main objective is to select the best wines to increase our revenues.

The variable importance is as follows:

average faulty good MeanDecreaseAccuracy MeanDecreaseGini
fixed.acidity -1.1789124 0.4207999 4.259495 -0.0313091 7.106187
volatile.acidity -5.5043443 18.3962683 15.377489 0.3575496 80.155634
citric.acid -2.5368459 1.5313713 7.474859 0.5444969 6.786713
residual.sugar 1.4540181 2.0167245 3.769658 2.9965290 11.190058
chlorides 1.5637435 2.1398075 5.241920 3.0042070 14.588990
free.sulfur.dioxide 0.8634829 9.3848467 5.325814 2.9541962 11.618034
total.sulfur.dioxide 4.7551936 8.4964902 11.070590 8.5796792 4.927110
density 2.1564976 2.7373943 5.800168 3.7947373 10.070354
pH -5.1593009 8.2437160 6.623883 -2.8505398 5.564237
sulphates -8.4061864 18.3962171 29.648546 5.7976396 102.364189
alcohol 0.9503377 7.0952271 35.577178 19.1527096 57.833428
And with the graphical representation

Based on the importance, the parameters that characterize a good wine are as follows:

  • Alcohol: Increase the alcohol
  • Total SO2: Reduce total SO2
  • Sulphates: Increase the sulphates
  • Residual Sugar: Increase residual sugar
  • Density: Reduce the density
  • Volatile acidity: Decrese volatile acidity

5. Summary and conclusions

We have run two different methods to find out the best variable predictors for our wine: Multilinear regression and Ramdon Forest. The outcome of the study is more or less similar.

On the one hand, the ** Multilinear regressesion** model is pretty good as we get a missclassification error of 14 %. This means that we are able to analyze the wines pretty well. Simulationr resuls also shown that these are the main important parameters to be considered when doing a good wine:

  • fixed acidity: Increase fixed acidity
  • Volatile acidity: Decrese volatile acidity
  • Residual Sugar: Increase residual sugar
  • Density: Reduce the density
  • Chlorides: Reduce cholrides
  • Total SO2: Reduce total SO2
  • Sulphates: Increase the sulphates
  • Alcohol: Increase the alcohol

And we do not need to forget about the key differences of a good wine from a bad wine are found in the Total SO2, Sulphates and the Alcohol level. We will need topay attention to those values.

On the other hand, the RandomForest method is strongly biased towards the faulty classification. However, the model is pretty good at predicting the good wines, and this is exacly what we are looking for. Simulation results show that the parameters that characterize a good wine are as follows:

  • Alcohol:
  • Total SO2:
  • Sulphates:
  • Residual Sugar:
  • Density:
  • Volatile acidity:

As it can be seen, both studies give similar results in terms of parameters. However, the RandomForest is not able to tell us if we need to increase or not these values to make a better wine.

And now comes the funny part, Do these parameters make sense? They do!!!!!

Total acidity in wine is known as titratable acidity, and is the sum of the fixed and volatile acids. Total acidity directly effects the color and flavor of wine and, depending on the style of the wine, is sought in a perfect balance with the sweet and bitter sensation of other components. The regression says that for a better taste the acidity of the wine has to be composed by a higher amount of fixed than volatile acids, that means a stronger sweet taste. This is aligned with what the regression states about residual sugar and chlorides, a wine has better quality with a higher proportion of sugar and a low proportion of salt.

Sulphates are a preservative that’s widely used in winemaking (and most food industries) for its antioxidant and antibacterial properties and they an important role in preventing oxidization and maintaining a wine’s freshness. But in high amounts, sulphates such as S02 can have a disgusting smell and taste, that is why the regression says that a certain increase of sulphates are good but also that an increase of S02 reduces the quality of the wine.

In wine, alcohol and density are negative correlated. While water has a density of 1 gram per cubic centimeter, alcohol has a density of about 0.79 g/cc, so the more alcohol vs other liquids a wine contains should decrease the overall density of the wine. This is aligned as well with the regression, showing that a better quality is associated with more alcohol, causing less density.

Good wine :)

Good wine :)

You cannot buy happiness, … but you can buy a good wine and that is kind of the same thing.